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fulness  of  so  called  "array  processors",  combined  with  mini-  and 
superminicomputers,  in  the  analysis  of  computationally  demanding 
engineering  problems,  such  as  in  finite  element  applications. 
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form  repetitive  computations  on  well  structured  data  sets  at  ef¬ 
fective  speeds  far  beyond  those  achieved  by  minis  and  superminis 
Although  their  speeds  are  substantially  below  the  newest  gener¬ 
ations  of  large  scale  pipeline  and  parallel  vector  computers, 
they  do  provide  a  clear  cost/performance  advantage.  Further¬ 
more,  the  combination  of  a  minicomputer  and  an  array  processor 
provides  a  high  degree  of  interactivity  in  addition  to  attrac¬ 
tive  computational  speeds. 

Using  an  array  processor/minicomputer  combination  effect¬ 
ively  is  a  difficult  task,  requiring  not  only  knowledge  of  per¬ 
tinent  algorithms  and  computer  programming,  but  also  a  clear 
understanding  of  the  hardware  and  the  details  of  its  operation. 
The  objective  is  to  produce  efficient  implementation  of  pract¬ 
ical  algorithms.  These  implementations,  however,  should  event¬ 
ually  be  user  transparent,  in  order  to  be  of  benefit  to  the 
engineering  community. 

The  array  processor/minicomputer  combination  is  a  special 
case  of  a  multiprocessor  system  in  which  only  two  processors, 
with  widely  differing  characteristics,  are  involved.  One  inter¬ 
esting  factor  here  is  that  the  communication  between  the  two 
processors  lee  ;es  a  substantial  amount  of  control  in  the  hands 
of  the  programmer.  Experimentation  with  such  a  system  provides 
an  insight  into  the  more  general  area  of  parallel  processing. 

The  issues  here  relate  to  the  synchronization  of  computations 
and  data  transfers,  as  well  as,  the  design  of  operating  systems 
and  applications  programs  to  handle  the  solution  of  complex 
problems . 

Several  operations  typical  of  finite  element  analysis  were 
chosen  as  a  basis  for  performance  measurements.  Some  experi¬ 
ments  measured  the  speeds  of  execution  on  the  host  computer  and 
the  array  processor,  acting  as  separate  processors.  Other  mea¬ 
surements  were  designed  to  assess  the  performance  of  the  integ¬ 
rated  system.  In  addition  to  these  measurements,  simulation 
was  used  to  predict  the  performance.  The  simulation  results 
are  matched  to  the  measured  results  in  order  to  gain  confidence 
in  the  simulation  procedures.  Once  this  has  been  accomplished, 
the  same  methods  are  used  to  predict  the  performance  more  gen¬ 
erally  . 

Measurements  were  taken  for  simple  matrix  operations, 
numerical  integration  of  solid  element  properties,  and  the  de¬ 
composition  of  a  hypermatrix.  In  the  case  of  the  computation 
of  stiffness  matrices  for  the  solid  element,  several  alternative 
strategies  were  attempted,  some  of  which  were  designed  to  mini¬ 
mize  data  transfers  between  the  host  and  the  array  processor. 

Finally,  the  experi mce  gained  so  far  is  used  to  examine 
the  possibility  of  implementing  several  algorithms  typical  of 
the  effectiveness  of  the  system  in  execution  of  such  computa¬ 
tions  is  made. 
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SUMMARY 

"The  paper  describes  ongoing  efforts  to  investigate  the  usefulness  of 
so  called  ''attached  processors*  or  Virrav  processors*,  combined  with  in  inl¬ 
and  superminicomputers,  in  the  analysis  of  computationally  demanding 
engineering  problems;  such  as,  in  finite  element  nonlinear  applications.  A 
close  examination  of  some  basic  algorithms,  which  pi av  an  important  part 
in  such  analysis,  is  presented.  Furthermore,  a  methodology  is  developed, 
which  fonns  a  useful  guide  to  further  efforts  of  this  nature,  including 
the  more  general  case  of  multiprocessing.  An  emphasis  is  made  on  the 
ability  to  predict  the  performance  of  algorithms  from  time  measurements  of 
certain  basic  operations,  coupled  with  an  understanding  of  the 
characteristics  of  the  hardware  in  question  and  the  interplay  between  its 
various  component;;.  Emr  hardware  alternatives  are  considered,  two  •>( 
which  have  attached  array  piocessors.  Of  the  algorithms  considered,  it 
appears  that  the  decomposition  process  is  the  one  which  benefits  most  tr'm 
the  presence  of  an  array  processor.  The  speed  advantage  gained  in 
stiffness  assembly,  and  forward  and  backward  substitution  tnav  both  be 


obtained  by  purchasing  a  more  powerful  superminicomputer. 


as 


INTRODUCTION 

The  advent  of  "super  computers",  such  as  the  STAR  and  CRAY  series, 
and  the  HEP  computer  [1],  provides  an  undoubted  breakthrough  for  large 
scale  finite  element  analysis  1 2 ] .  The  availability  of  inexpensive 
microprocessors  has  triggered  efforts  to  devise  microcomputer  networks, 
using  parallel  and  distributed  processing  to  provide  faster  solution  of 
finite  element  problems  [3,4],  Finite  element  software  is  moving  into  a 
new  ere,  in  which  the  proper  programming  environment  and  modular  program 
systems  [5]  are  replacing  monstrously  large  programs. 

In  contrast  to  efforts  based  on  high  performance  sixth  generation 
computers  [2,6],  this  paper  describes  ongoing  efforts  [7)  to  investigate 
the  usefulness  of  so  called  "attached  processors"  or  "array  processors", 
combined  with  mini-  and  stipe  tin  i  n  i  c  •xuptil  rrs  ,  in  the  analysis  of 
computationally  demanding  engineering  problems;  such  as,  in  finite  element 
nonlinear  applications.  The  work  presented  here  does  not  include  actual 
solutions  of  such  problems.  Rather,  a  close  examination  of  some  basic 
algorithms,  which  play  an  important  part  in  nonlinear  analysis,  is 
presented.  Furthermore,  a  methodology  is  developed,  which  forms  a  useful 
guide  to  further  efforts  of  this  nature,  including  the  more  general  case 
of  distributed  processing.  An  emphasis  is  made  on  the  ability  to  predict 
the  performance  of  algorithms  from  time  measurements  of  certain  basic- 
operations,  coupled  with  an  understanding  of  the  characteristics  of  the 
hardware  in  question  and  the  interplay  between  its  various  components. 

The  array  processor/min i computer  combination  is  a  special  case  of  a 
multiprocessor  system  in  which  only  two  processors,  with  widely  differing 
characteristics,  are  present.  Experimentation  with  such  a  system  provides 
an  insight  into  the  more  general  area  of  parallel  processing.  The  issues 


here  relate  to  the  synchronization  and  balancing  of  computations  and  data 


transfers,  as  well  as  the  design  of  operating 


systems  and  applications 


programs  to  handle  the  solution  of  complex  problems  [8,9,10], 

Measurements  taken  on  an  array  processor/min icomputer  system  show 
that,  while  the  array  processor  is  perhaps  80  times  faster  than  the  host 
16  bit  minicomputer  in  the  performance  of  basic  matrix  arithmetic,  it  is 
often  not  possible  to  apply  this  computational  power  to  practical 
problems,  due  to  the  various  restrictions  imposed  by  the  limited  usee 
address  space.  An  alternate  system  is  now  being  put:  together  in  which  a 
bit  minicomputer,  with  a  large  address  space  and  a  faster  data  transfer 
rate,  is  used  as  the  host. 

Several  algorithms  tvpica!  of  finite  element  analysis  were  chosen  as 
a  basis  for  performance  measurements.  Some  experiments  me  a  sir.  ed  the  speeds 
■'I  execution  on  the  host  computer  and  the  array  processor  acting  as 
separate  processors.  Other  measurements  were  designed  to  assess  the 
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examine  the  possibility  of  implementing  these  algorithms,  which  we  believe, 
are  typical  of  nonlinear  finite  element  analysis,  on  the  more  powerful 
12-bit  minicomputer  system.  An  assessment  of  the  effectiveness  oi  the 
alternate  svstem  in  execution  of  such  computations  is  made.  Some  caution 
has  to  he  exercised  in  the  assessment,  of  the  performance  if  the  16-bit 
minicomputer  based  system.  Although  some  of  the  predicted  times  for  large 
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problems  may  appear  competitive,  one  should  not  forgc-t  that  severe 
restrictions  are  placed  on  problem  size  due  to  address  limitations, 
rendering  these  figures  purely  hypothetical,  except  in  some  dedicated 
systems  . 


NOMENCLATURE 

It  is  useful  to  introduce  some  symbols  and  abbreviations,  at:  this 
point,  in  order  to  describe  alternate  computer  configurations,  and  refer 
to  problem  parameters.  These  abroviations  are: 

HC16  Refers  to  16-bit  minicomputer  used  independently  or  in 
conjunction  with  an  array  processor.  It  is  characterized  by  a 
limited  user  address  space  (typically  64  KB)  and  a  slow  data 
bus  (typical  speed  up  to  1.5  MB/sec). 

H032  Alternate  32-bit  superminicomputer  used  independently  or  with 
an  array  processor.  Since  the  array  processor  is  assumed  to 
provide  the  primary  computational  tool,  the  selected  HC32 
system  has  the  advantages  of  a  32-bit  mincomputer  of  a  large 
address  space  (16  MB)  and  a  fast  data  bus  (speed 
2(i.5  MB/sec).  On  the  other  hand,  its  floating  point 
processing  speed  is  somewhat  limited  (0.435  MFI.OPS  in 

Whetstone  tests,  as  opposed  to  the  11016  time  of  0.190). 
Todays  popular  superminis  have  been  measured  at  1.2  MFLOPS , 
and  the  top  of  the  line  of  our  UC32  series  has  been  measured 
at  3.5  M FLOPS . 

AP  Stands  for  array  processor.  The  one  used  in  this  research  is 

primarily  a  double  precision  device,  capable  of  performing  64 
bit  vector  arithmetic,  and  other  well  structured 
computations,  at  speeds  well  above  those  of  mini-  and 


superminicomputers.  So  far,  however,  it  has  been  used  in  the 
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single  precision  (  32  —  hit  ">  mule,  due  to  .id  cl  ross  sp.iri- 
limitations.  A  single  precision  v.-r  i/mt  of  the  same  ar  r.:v 
processor  perform  the  same  luuet  ions  fasti  r  and  are 
considerably  less  expensive. 

HC16+AP  Combination  of  the  1IC 1 C  and  AP.  The  d  evices  communicate  via  a 
slow  interface,  using  the  HClfi  low  speed  data  bus 
extensively . 

HC32+AF  Combination  of  11C 32  and  A!’,  which  are  coupled  via  a  high 
speed  direct  memory  interface. 

I1  Total  number  of  unknowns  in  the  problem  being  solved. 

b  Half  bandwidth  of  problem.  Based  on  element  connectivity. 

Nubian  t  r  ix  size  used  to  partition  the  svstem  of  equations. 

1’  T  '  t  a  1  number  of  partitions  in  Lite  stiffness  matrix. 

H  Number  of  non-zero  submat r i cos ,  per  row,  in  stiffness  matrix. 

M  Number  of  longitudinal  stations  in  the  solid  model  being 

anal y  steel  . 

N  Number  of  nodes  across  the  solid  model  in  either  direction. 

K  Total  number  of  g  noded  isoparametric  solid  elements  used  to 

model  the  solids  problem. 


HABDWAK K  CHAHATER 1ST1CS 

An  array  processor,  also  called  an  attached  processor,  is  a  high 
speed  special  purpose  computational  device,  which  can  perform  repetitive 
computations  on  well  structured  data  sets  at  effective  speeds  far  beyond 
those  achieved  by  minis  and  superminis.  Although  their  speeds  are 
substantially  below  the  newest  generation  of  large  scale  pipeline  and 
parallel  vector  computers,  they  do  provide  a  clear  cost /per formance 
advantage.  Furthermore,  the  combination  of  a  minicomputer  and  an  array 
processor  offers  a  high  degree  of  interactivity  in  addition  to  attractive 


c  ompu t  a  t i on a  1  speeds. 


The  array  processor  is  a  sepai  ite  computer,  running  under  an 
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involving  extensive  decision  making.  Data  transfers  and  message  swapping 
between  the  host  and  the  array  processor  are  problematic  and  can  a f feci 
performance  appreciably,  as  seen  from  some  ->f  the  measurements  taken.  Tim 
amount  of  AP  local  memory  may  be  limited  in  certain  instances.  "his, 
however,  does  not  apply  to  the  system  under  consideration. 

Several  AP  types  are  available  today.  We  restrict  the  discussion  to 
the  specific  device  used  in  this  research  which,  we  believe,  represents 
the  state  of  the  art.  In  order  to  understand  the  operation  of  the  system, 
Fig.  1  explains  the  two  basic  components,  and  the  various  software  items 
which  control  it,  as  well  as  the  location  of  the  data.  The  operating 
system  of  the  host  computer  controls  the  operation  of  the  host.  including 
the  user  program,  which  specifies  the  sequence  of  events  me  ssarv  to 
perform  the  computation.  A  special  piece  of  software,  called  he  DRIVER, 
resides  in  the  host  computer,  ft  facilitates  communication  with  the  arrav 
processor  bv  translating  user  FORTRAN  calls  into  small  packages  called 
Function  Control  Blocks  or  FCBs  ,  which  are  shipped  to  the  array  processor 
to  initiate  specific  computations  or  data  transfers  on  the  latter  device. 
The  format  of  a  typical  FCB ,  which  occupies  six  16-bit  words  ( t!C  1 6 )  or 
three  32-bit  words  ( 11(132),  is  given  in  Fig.  2.  It  contains  a  function  cod., 
and  operand  addresses,  which  take  the  form  of  array  processor  buffer 
numbers,  as  well  as  some  control  variables  ( checksum,  etc.). 

On  the  array  processor  side,  an  independent  operating  system,  called 
the  EXEC,  resides.  Its  function  is  to  control  the  operation  of  the  arrav 
processor  itself.  In  the  array  processor  there  exists  a  mathematical 
library,  consisting  of  a  number  of  unlinked  (relocatable)  routines,  which 


may  be  assembled  into  any  program  being  executed  inside  t  h  <  -  arr,.v 
processor .  One  of  the  functions  of  KXKC  is  t  o  link  such  progress,  as 
needed,  to  the  specifications  contained  in  the  P'CBs  ,  and  control  1  h**  i  r 
execution  within  the  array  processor.  Such  programs  take  the  form  of  tu  > 
separate,  but  fully  synchronized ,  programs  called  the  API!  and  APS  programs 
respectively.  Data  are  stored  within  the  array  processor  while  he  inn 
operated  upon  in  special  areas  called  btiflers.  One  import  ant  cons i d era t  i or 
for  the  system  at  hand  is  that,  in  the  FGB .  a  restricted  number  of  hits  is 
used  to  store  the  buffer  number.  This  poses  a  restriction  or.  the  number  -M 
buffers  which  may  be  defined.  At  the  moment:,  the  maximum  number  ot  buffers 
is  64.  Soon  it  will  be  expanded  to  512.  The  hardware  ;;ouf igurai  i ->ns  !  li¬ 
the  HC16+AP  and  the  HC32-tAP  are  given  in  Fig.  ?  and  Fig.  4  re  s  pec  t  i  v  ■  ■  I  v . 

BASIC  MFA.SPKMMh.NTS 

In  at  tempting  to  assess  the  (Effectiveness  of  ;;  part  icular  hardware 
system,  such  as  the  one  being  considered  here,  several  approaches  are 
pissihle.  One  approach  is  to  run  specific  problems  on  an  existing  s  v  s  t • m . 
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c  oun  t  s 

for  each  basic  task  from  which  performance  estimates  may  be  made.  It  is 
possible  then  to  run  large  numbers  of  parametric  studies,  in  order  t e 


assess  the  effect  of  different  problem  parameters ,  as  well  as  t  !u 
effectiveness  of  the  hardware  for  certain  a  1  gor  i  t  inn  r;  .  'fhoso  ovrall 
estimates  are  then  verified  bv  actual  compel  t  ions  of  real  i  i  t  i  c  problems. 

Four  systems  have  bivn  considered,  lit’.  1  ,  lin'd,  ilCKt+AP  and  HC'!7.+AP, 

So  far  the  performance  of  the  HCK  and  HUlM/ip  have  been  measured.  The 
performance  figures  for  the  other  two  systems  have  been  estimated.  since 
they  have  not  been  fully  installed  yet.  ft  is  important  to  note  that  hot! 
host  CPU  and  elapsed  times,  associated  with.  each,  operation,  have  h".t 
measured.  However,  on  1 v  elapsed  times  are  presented.  By  doing  this,  i:  i- 

implied  that  performance  is  measured  by  the  clock  t  into  needed  to  ip. 

the-  algorithm  rather  then  by  the  CPU  time,  which  can  he  deceptive.  l!  i 
also  assumed  that  a  system,  such  as  the  on.-  examined  here,  is  intend- ■ 
primarily  as  a  stand  alone  high  performance  system,  even  though  il  m  v 
well  be  capable  -if  lime  sharing  and  multiprocessing. 

Another  issue  to  consider  is  that  all  measurement's  and  storage  sp.  .< 
c  'nsider, • <r  ions  a  re  for  single  precision.  (.Id-hit)  cor, put  at  i  uis,  even  !  h-uittl 
the  array  pr-uessir  is  a  double  pricision  system,  which  lias  also  sing), 
precision  capabilities.  We  expect  to  extend  the  i.iea  sur ,  men  t  s  let-  r  t 
doubl  »•  |»r  •  ci  si  on.  Many  of  the  conclusions  reached  here,  however  ,  m;  v  !-< 

generalised  or  extended,  using  reasonable  assumptions.  Chapping  to  douh!< 
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precision  will  impact  primarilv  tie  HU  1  •">  ,  since  the  address  space  is  uuiti 
limited.  Fxecut  ion  sp-eds  will  also  affect  the  host  processor  more  th.si 
the  AP,  s  i  nc  •>  l  lie  AP  is.  basical iv,  a  double  precision  devi,.  .  It  is  .;  ]  s 
important  t  >  not «  that  it  s i ng  i .  prec i s i op  wire  adequate  ,  less  exi-nsiv 
array  processors  are  ;:vai  1  able,  which  would  h<  much  taster. 
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overhead  .  Figures  5.  a  and  !> .  b  show  plots  •  •  f  re  a  -aired  and  estimated  data 
transfer  speeds  lor  lit;!6+AP. 
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STIFFNESS  t'OMPb’TA'l  ION  AND  ASSEMBLY 


Tlu'  process  of  stiffness  assembly  is  a  basi<  n  <  1  import  ; ml  -.no  for 
both.  linear  anu  nonlinear  analysis.  Hire,  two  basic  functions  are 
per formed ,  element  stiffness  c  •■input  at  i  on  anil  nssemb  1  v  int"  tin-  mast  >-r 
stiffness  matrix.  For  the  purpose  of  this  investigation,  the  solid  monel 
shown  in  Fig.  8  vws  chosen,  liv  varying  the  number  of  stations  CM)  and  tin- 
number  of  noiles  in  each  direction  -.1  the  cross-section  (K),  <!  i  1  !  <  i  •  ■  :t  t 
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.'•!  tiiis  p  o  i :  a  ,  it.  is  uppropri.it  ,  t>  bring  up  the  is-us-  oi  pi-'hl  -• 
siz.-  limitations,  which  apply  here  <<|ualiv  l-'r  stiffness  assemble  us! 
•  lee  o;,ip-is  i  t  i  >n .  Fig.  (l  illustrates  this  point.  Several  factors  limit 
ptobl-'Ui  si;i"  ,  ar.i!  t  lie  se  constraints  are  represented  by  straight  1  in.-s  or 
curves  in  the  figure.  F-'r  example,  the  efteef  ->|  the  limited  number  of 
but  tors  is  s'n  >w:i  bv  vert  ical  1  ines.  At  present  .  the  system  has  a 
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this  to  512,  allowing  11  submatrices  per  row.  The  size  of  memory  1  in  the 
Ai  :s  urrentlv  •  ipial  t  •>  lb  KW ,  minus  II  KW  needed  for  KXKO  and  the 
mat  beta  a  t  i  al  lib-rarv.  Since  in  tin  current  decomposition  routine 
submat  r  ices  are  placed  in  this  num-irv ,  a  submatrix  size  of  30X3C-  can  imt 
br  exceeded,  [f  (hr  memory  is  expanded  to  32  KW,  it  is  possible  to  employ 
62X62  matrices;  and  if  extended  to  f. /«  KW,  matrices  over  lOOXlt-f  may  In 
handled.  With  the  total  space  available  on  memories  2  and  3,  a  maximum 
halt  bandwidth  h,  of  a  ppr -ix  imat  e  1  y  ’'20,  -an  lx  handled.  This  is 
illustrated  by  the  f  i  r  s  l  of  set  •<!  curves  showing  the  possible 


combinations  of  S  and  H  whieh  may  bo  handled.  If  the  two  m.'mnri.s  are 
expanded  to  their  lull  capacity,  b  r.i  a  v  br  increased  to  A ')  f  ’ .  If  1  .  r  n.  t  •  i" 
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arrangement  and  it  w.is  chost  u  as  the  working  algorithm.  The  estimated 
••lapsed  t  tine  lot  lilts  scheme  is  0.r?2  si-e  onds/f  1  emont  for  the  HClb+AT  and 
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In  addition  to  the  time  needed  fot  in-core  c  ompn  t  a  t  i  on  and  assembly, 
■t  certain  number  of  submatrix  translers  to  and  from  disk  are  required 


during  the  assembly  process.  The  time  required  for  those  transfers  was 


also  computed  and  included  in  the  estimate.  Results  from  various  runs  show 
that  the  stiffness  computation  process  is  speeded  up  by  a  constant  factor 
of  a  ppr  ox  imat  e  1  v  5  lor  the  11(116,  and  3.5  tor  11C32  ,  regard  1  *-ss  of 
bandwidth,  bv  adding  the-  array  processoi .  This  result  is  somewhat 
disappointing  perhaps,  hut  similar  conclusions  have  been  reached  by  others 
[2],  Table  2  gives  the  operations  count  and  timing  for  a  realist  i  r 
(20X12X12)  problem  where  tin'  number  of  unknowns  ’  s  F650  and  the-  half 
bandwidth  is  900. 
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program  and  the  opt  mmm  algorithm,  some  di f  ferences  in  the  counts  were 
observed .  In  order  t>  remove  tie-  effect  of  the  discrepenev,  a  correction 
was  applied,  and  both  I  e  collected  and  ur.eorrc  c  t  ed  figures  are  given.  In 
both  cases,  the  ■■it ort  ar<  w  •  1  I  within  tl'c  accepted  limits,  part  icu  1  at  1  v 
when  the  measured  t  imes  can,  themselves,  vary  to  the  same  order  of 
magn  i  t  tide  . 'lit  i  s  is  illustrated  bv  one  of  the  cases,  which  was  run  twice  and 
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imat  es . 

The  adjusted 

run  time 

s  are  more 

accurate  than 

the  unadjusted  ones 

•  Pig- 

10  gives  a 

p  i  c  t  o  r  i 

al  representation  of  tin- 

results  for  a  set  of  decompositions  where  the  bandwidth  is  kept  at  a 
constant  value  of  2/0.  The  results  give  a  certain  amount  of  confidence  in 
the  basic  approach  used  in  this  paper  and  allows  us  to  extend  flu 
estimates  to  other  algorithms. 


FORWARD  AND  BACKWARD  SUBSTITUTIONS 


The  same 

pr  oc 

edur 

e  appl  i  e 

d  s 

o  far 

t->  lit 

k 

st  i f fne 

s  s 

n  s 

Selllb  1 

and 

d  is 

r  omp'isi  t  i 

on 

i  s  n 

ex  t 

emplovt'd 

Co 

obt  a  i  n 

e  s  t  i  m  a  t 

f s 

o  f  r- 1 

<nr 

s-d 

!  imes 

f -li¬ 

tin 

■  tot  ward 

p 

nd  ha 

ckwn 

rd  sub st 

i  t  ut 

i  ons  t 

v  p  i  c  a  1 

of 

non  1  i 

no 

.■=  r 

iter t  i 

on  s  . 

IT.  - 

.*  results 

> 

for  t 

he 

same  so 

I  id 

mode  1 

d  e  sc  r  i 

bed 

above 

> 

;i  r  e 

g  i  vcr. 

i  r. 

Tal 

D'e  h.  Ih 

t  •  V 

s  how 

a  <! 

*•1  i  it  i  t  e  , 

ve  t 

smal  1  , 

,  ad  van t 

age 

for  th 

t  ’ 

HC  3  2 

■*  A  i '  s  v 

s  t »  in 

•'V' 

■r  t  he  IIC 

3? 

.  Out¬ 

pro 

b  1  em  sc- 

*  in  s 

t.  o  b  t 

the 

i  ne 

f feet  iv 

p  n  o  s  s 

->f  si 

m  ;>  1  p 

adi 

1  i  t  i  ons 

i  n 

side 

the 

a  r  ray 

processor . 

In  that 

case  , 

i  t 

mi  ght 

advantageous  to  per;  u-m  this  part  of  the  compulation  inside  the  host 
c ■ wput or .  This  would  speed-up  the  process  somewhat  ,  giving  a  total  speed 
up  I  act  or  of  5.  I?  over  t  lie  HC-To  alone.  At  tin’s  point  in  time,  it  appears 
that  the  advantage  o |  the  array  processor  is  not  as  substantial  here. 
However,  careful  men sur ( m.-n t s  are  required  before  final  conclusions  are 


land  e  . 


ON  THE  ADVANTAGES  OF  PARALLEL  COMPUTATION 


In  a  system  incorporating  multiple  nsvn "hron  ms  ilrviccs,  the  question 
comes  to  mind  ns  to  the  possible  ad vjiiiL; ige ;•  of  parallel  computation. 
Whether  an  algorithm  can  be  nccel  1  crated  by  allowing  the  different 
hardware  modules  to  operate  independently,  performing  different  but 
related  functions  at  the  same  time,  depends  >n  both  the  algorithm  and  the 
hardware  configuration  | 7  J .  Fig.  11  shows  a  hypothetical  case  fir  which 
parallel  processing  is  to  be  applied  to  a  specific  algorithm.  Three 
devices  are  considered,  a  host  computer,  a  data  transfer  bus,  and  an  array 


pr*»ci  ss  >r  . 

big. 

! 1 . a  i 1  I  us  t rat  us  t  he 

SOCjmMK*  <  ■ 

of  operat 

ions,  in  r 

eal  t  ! 

Ill*’  . 

for  the  given  system,  assuming  serin 

1  computation.  The 

same  proci.' 

s  s  m  a  v 

b.- 

represent  a 

d  more 

e 1 e a r 1 v  in  F i g .  11. 

h ,  where 

it  is  cv  i  <1 

len r  that 

the  a  r 

r  .1  v 

pr  ices  sor 

is  not 

being  used  suffieie 

n  1 1  y  .  l'li  e 

host  and 

d  ala  b  u s 

,  on 

r  1; 

other  hand 

.  a  r  ■,  ■ 

list'd  approximately  h 

a  If  the  l 

ime.  The 

maximum  b 

e  n  i  •  t  i  t 

t  •» 

he  derived 

from 

pa  ra  1  1  c  l  i  s;n  ,  in  this 

c  a  si',  i  s 

shown  in  Fig.  11 

wit 

i  eh 

a  s  s  i  tin  e  s 

t  h  a  t 

all  devices  can 

pe  r  for; 

m  the 

requi red 

f  1 1  n  c  t  i 

•Ml  S 

s  iinu  1  t  a  in  -  on  s  1  v  . 

'ibis  is  usna  1  I  v  not 

|  >  o  s  s  i  b  1  e 

,  however 

sine  c 

operat  i 

•*»n  s 

c  mducted  in  one  part  of  the  hardware  usually  requires  that  other 
operations  conducted  elsewhere  be  completed  first,  which  results  in  an 
actual  run  such  as  that  shown  in  Fig.  11. cl. 

Whether  a  particular  algorithm  can  benefit  from  parallel  computation 
or  not,  r,  "St  first  he  determined  hv  a  simple'  analysis  of  the  demand  on 
computer  resources.  It  the  largest,  amount  ••  time  is  spent  in  one 
particular  device,  parallel  operation  will  most  certainly  not  help.  It 


might  be  puss 

i  b 1 e  to 

sh  i  f  t 

some  of  the 

load  from 

one 

device  to  another, 

i  n 

order  t'  oht  a 

in  a  tnor 

e  b  a  1 

a need  load  on 

the  s  v  s  t  e 

m  . 

If,  for  exampl r  , 

an 

algor  it  lim  is 

compel  e 

h  omul 

,  smti  as  in 

the  case  ■ 

if 

the  decomposition, 

i  t 

i s  poss i b 1 e  t 

o  incre 

a  s  o 

the  spec'd  by  supplvi 

ng 

more  than  one  nr 

r  a  v 

pr.  'Cessor  ,  or 

C !T' ,  in 

orde 

r  to  achieve 

a  balance 

be 

tween  computation 

and 

data  transfer  times.  On  the  other  hand. 


it  the  process  suffers  from 


excessive  <J a t a  transfer  times,  data  buffering,  faster  data  busses,  .•■long 
with  multiple  disk  drives  and  controllers  m.'iy  b<  used.  Careful  p  I  an  u  i  ng.  oi 
system  I/O  operation  would  also  help.  Once  it  is  determined  that  an 
algorithm  is  potentially  suitable  for  parallel  computation,  detailed 
analysis  in  which  the  discrete  nature  «>f  the  demands  on  each  system 
component  is  taken  into  consideration,  is  worth  undertaking  [?]. 

U 'NULL'S  ION'S 

The  paper  has  succeeded  in  devising  a  methodology  for  predicting  the 
performance  improvement  as  a  result  of  the  addition  of  an  array  processor 
to  a  minicomputer.  Four  systems  wi  re  considered,  two  of  which  have 
.attached  .irr.iv  processors.  Of  the  algorithms  considered,  it  appeals  that 
the  decomposition  process  is  the  one  which  benefits  most  from  the  presence 
•  f  an  rrrav  processor .  The  speed  advantagi  gained  in  stiffness  assemble, 
and  forward  and  backward  substitutions  m.iv  both  be  obtained  bv  purchasing 
a  more  powerful  superminicomputer.  As  such,  it  appears  that  a  mix  of  jobs, 
in  which  the  host  c  nuputer  is  freed  to  handle  other  tasks  such  as  pro-  and 
postprocessing  along  with  linearized  iterations  while  the  array  processor 
is  occupied  in  the  deconiposit  ion  process,  might  be  advantageous.  One  might 
think  of  alternatives  to  traditional  computational  strategies.  for 
example .  in  the  modified  Newton  method,  the  host  computer  may  be  used  to 
improve  on  a  current  solut  ion  vector  based  on  a  previously  obtained 
d  i  c -siipos  i  t  i  on  ,  while  the  array  processor  is  used  to  compute  and  dec-imp  >se 
a  in  w  stillness  matrix  based  on  a  more  recent  solution  point.  Using.  the 
current  time  ost  imates,  the  advantage  gained  m.iv  he  small.  Only  e 
iterations  can  be  performed  in  the  time  required  by  the  HC32+AP  to  compute 
a  new  stiffness  and  decompose  it.  Sparse  matrix  computations  m.iv  be  used 
for  the  iterative  procedure,  rather  than  a  direct  solution.  It  is  too 
early  to  t i nd  out  whether  such  algorithms  could  he  effective  on  an  arrav 
processor . 


One  must  state,  however,  that  the  apparent  advantages  of  the  arr.iv 
processor  in  the  decomposition  algorithm,  which  seems  to  he  the  most  t  ime 
consuming  of  all  computational  steps,  is  not  to  be  taken  lightly.  It.  is 
clear  Chat  such  speeds  can  not  be  achieveci  with  simple  computer  deei^  '-s 
with  the  same  price  range. 

Many  other  conclusions  come  to  mind.  A  16-hit  minicomputer  with 
limited  address  space  does  not  appear  to  he  suitable  as  a  host  for  a  las! 
array  processor.  The  restricted  memory  size  drast  icallv  limits  pr obi  m 
size.  A  conventional  commnn  i  c  at  i  on  interface  is  nit  adequate  if  the 
m  in  ic  ompu  t  er  /  ar  r  a  v  processor  combination  is  to  he  utilized  e  t  feet  ivel  v  .  It: 
using  the  hvpermatrix  scheme,  small  suhmat rices  ■■•re  l  '  he  avoided .  A 
minimum  size  of  ?0X3P  is  recommended .  For  larg  1  problems,  stiffness 
computation  and  assembly  is  speeded  up  bv  a  factor  of  ?. 5,  forward  and 
backward  subst  itutiou  is  speeded  up  by  a  tart  or  ’1  to  r>  ,  but  matrix 

dec  ompos  i  t  i  on  may  rut:  as  much  as  40  times  faster.  Parallel  processing  does 
not  seem  to  offer  much  of  an  advantage,  unless  mull i pi e  array  processors 
ate  used  or  a  mix  of  jobs  is  carried  on  s imu 1 taneous 1 v  bv  the  system.  The 
manufacturing  oj  specialized  hardware  to  perform  certain  functions 
t  f  feet  ivel  v  i«  valid  idea  and  will  play  an  increasingly  greater  role. 
Future  hardware  will  probably  take  the  form  of  a  network  of  such  systems  . 
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ITEM 

AP 

no  16 

HC32 

HC-DK 

5  5.33 

9.40 

HC-AP 

4  5.87 

16.10 

Mult ipl i cat  ion 

73.58 

5448.40 

2724.20 

Add  it  ion 

10.00 

65.50 

32.75 

Inver s ion 

112.34 

8039.34 

4019.67 

MRB  transfer 

8 . 50 

1 .00 

FCB  transfer 

8 . 50 

i  .  00 

Table  !.  TIMING  OF  BASIC  OPERATION  FOR  (40X40  SUBMATRICES) 

(Times  in  msecs.) 


ITEM 

HC  1  6 

11C  32 

HC16+AP 

SPEED  UP 
FACTOR 

HC32hAP 

SPEED  UP 
FACTOR 

Trans  for s 

23  7.7 

3  5.7 

237.7 

1  .000 

35.7 

1  .000 

Elmi'.L.  C'inp. 

7687.4 

384  3.0 

1  200.  1 

6.406 

1 078. 2 

3.565 

Total 

7925.6 

3879.7 

1437.8 

5.512 

1114.0 

3.483 

Para!  Id 

7687.9 

3843.9 

1  200.  1 

6.406 

1078.2 

3  .  565 

NOTH:  HC16  times  are  hypothetical  due  to  memory  restrictions. 

lab:.-  2.  KKIA-J  1  Vi:  PERFORMANCE  FOR  TYPICAL  SOLID  PROBLEM 
(STIFFNESS  ASSEMBLY)  (Times  in  secs.) 


ITEM 

ill.  !  6 

11C 12 

HC  IMAP 

SPEED  UP 
FACTOR 

I1C72+AP 

SPEED  UP 
FACTOR 

Tran  s  f er s 

475  .  r' 

71.5 

838.6 

0.567 

71.5 

1  .000 

Mill  t  i  pi  icat  i  on 

555952 . 1 

277976.0 

7144.0 

77.821 

7  1 44 . 0 

3  8 . 9  1  0 

Add i t i on 

2085.4 

1042.7 

143.3 

14.550 

143.3 

7.275 

Inversion 

FCB  transfer 

3906.4 

1953.2 

51.9 

526.4 

75.271 

51  .8 
61.9 

37.635 

Total 

562419.3 

281043.3 

8704.3 

64.614 

7472.7 

37.609 

Para  1  1  r*  1 

561 943.8 

280971 .8 

7339.2 

76.567 

7339.2 

38 .283 

Table  3.  RELATIVE  PERFORMANCE  FOR  TYPICAL  SOLID  PROBLEM 
(DECOMPOSITION)  (Times  in  sees.) 


NO  . 
o  f 

Prtns  . 

Submat  r i x 
size 

S 

N . S .M/ row 

H 

U 

b/U 

Est  imat:  ml 
( prog  .  ) 

Mo as ur  e 
( run) 

Pent 

or  r 

Ad  j  . 

Pent 

err 

40 

30 

0 

1  200 

0.225 

2  !  7 

2  57 

1  6 

2  a  a 

7 

30 

30 

9 

900 

0 . 3 

1  S5 

1  79 

1  3 

163 

5 

20 

30 

9 

180 

0.45 

94 

119 

21 

I  !  1 

!  5 

15 

30 

9 

4  50 

0.60 

6  3 

82 

2  3 

- 

- 

15 

30 

9 

450 

0.60 

63 

74 

15 

- 

- 

15 

24 

9 

360 

0 . 60 

4  9 

51 

4 

- 

- 

15 

20 

0 

300 

0.60 

42 

4  5 

7 

- 

- 

15 

16 

9 

240 

0.6O 

38 

34 

1  2 

- 

- 

15 

10 

9 

1  50 

0.60 

3  3 

31 

6 

- 

- 

Table  4.  COMPARISON  of  MEASURED  ami  PREDICTED  TIMES 
(MATRIX  DECOMPOSITION)  (Times  in  secs.) 


ITEM 

1IC  16 

HC  3  2 

11C16+AP 

SPEED  UP 
FACTOR 

1IC32+AP 

SPEED  UP 

FACTOR 

Tn  r  n  s  f  e  r  s 

934.3 

140.5 

1  647.8 

.  567 

140.5 

1  .(10  0 

Mu  1 1 ns  . 

2564.6 

1282.3 

64.9 

79.540 

6.4 . 11 

19.770 

Add  i  t  i.  on 

30.9 

15.5 

8  3 . 3 

.  371 

8  3.3 

.  185 

Inversion 

FCB  transfer 

.0 

.0 

.0 

416.7 

.  000 

.0 

49.0 

.  COO 

TOTAL 

3529.8 

1438.2 

2212.7 

1  .  595 

33  7.7 

4.2  59 

PARALLEL 

2595.5 

1 297.8 

2064.5 

1  .257 

189.5 

6 . 848 

T  able  5 . 


RELATIVE  PERFORMANCE  FOR  TYPICAL  SOLID  PROBLEM 
(FORWARD  AND  BACKWARD  SUBSTITUTIONS)  (Times  in  sees.) 
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